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Abstract: The multi-armed bandit problem (MBP) is the problem of finding, as accurately and quickly as possible, the 
most profitable option from a set of options that gives stochastic rewards by referring to past experiences. Inspired by 
fluctuated movements of a rigid body in a tug-of-war game, we formulated a unique search algorithm that we call the 
‘tug-of-war (TOW) dynamics’ for solving the MBP efficiently [1-5]. The cognitive medium access, which refers to multi¬ 
user channel allocations in cognitive radio, can be interpreted as the competitive multi-armed bandit problem (CMBP); 
the problem is to determine the optimal strategy for allocating channels to users which yields maximum total rewards 
gained by all users [6]. Here we show that it is possible to construct a physical device for solving the CMBP, which we 
call the ‘TOW Bombe’, by exploiting the TOW dynamics existed in coupled incompressible-fluid cylinders. This analog 
computing device achieves the ‘socially-maximum’ resource allocation that maximizes the total rewards in cognitive 
medium access without paying a huge computational cost that grows exponentially as a function of the problem size. 


INTRODUCTION 

Consider two slot machines. Both machines have in¬ 
dividual reward probabilities Pa and Pb . At each trial, a 
player selects one of machines and obtains some reward, 
for example, a coin, with the corresponding probability. 
The player wants to maximize the total reward sum ob¬ 
tained after a certain number of selections. However, it 
is supposed that the player does not know these prob¬ 
abilities. The multi-armed bandit problem (MBP) is to 
determine the optimal strategy for selecting the machine 
which yields maximum rewards by referring to past ex¬ 
periences. 

In our previous studies [1-6], we have shown that 
our proposed algorithm called the Tug-of-War (TOW) 
dynamics is more efficient than other well-known al¬ 
gorithms such as the modified e-greedy algorithm and 
modified softmax algorithm, and comparable to the ‘up¬ 
per confidence boundl-tuned (UCB1T) algorithm’ that 
is known as the best algorithm among parameter-free al¬ 
gorithms [7]. Moreover, the TOW dynamics effectively 
adapts to a changing environment in which the reward 
probabilities dynamically switch. The algorithms for 
solving the MBP are useful for various applications, such 
as the cognitive radio [8,9], web advertising [10], and 
the Monte-Carlo tree search that is used for programming 
computers to play ‘game of GO’ [11,12]. 

Recently, the cognitive medium access problem is one 
of the hottest topics in the field of mobile communica¬ 
tions [8,9]. The underlying idea is to allow unlicensed 
users (i.e., cognitive users) to access the available spec¬ 
trum when the licensed users (i.e., primary users) are not 
active. The cognitive medium access is a new medium ac¬ 
cess paradigm in which the cognitive users should not in¬ 
terfere with the licensed users. To avoid interfering with 
the primary network, the cognitive users must first probe 
to determine whether there are primary activities in each 
channel before transmission. 

Figure 1 shows the channel model proposed by Lai et 
al. [8,9]. There is a primary network consisting of N 
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Fig. 1 Channel model. 


channels, each with bandwidth B. The users in the pri¬ 
mary network are operated in a synchronous time-slotted 
fashion. It is assumed that, at each time slot, channel i is 
free with probability Pi . The cognitive users do not know 
Pi a priori. At each time slot, the cognitive users attempt 
to exploit the availability of channels in the primary net¬ 
work by sensing the activity in this channel model. In 
this setting, a single cognitive user can access only a sin¬ 
gle channel at any given time. The problem is to derive 
an optimal accessing strategy for choosing channels that 
maximizes the expected throughput obtained by the cog¬ 
nitive user. This situation can be interpreted as the multi¬ 
user competitive bandit problem (CMBP). 

For simplicity, we consider the minimum CMBP, i.e., 
2 cognitive (unlicensed) users (1 and 2) and 2 channels 
(.A and B). Each channel is not occupied by primary (li¬ 
censed) users with the probability Pi . In the MBP con¬ 
text, we assume that the user accessing a free channel can 
get some reward, for example a coin, with the probabil¬ 
ity Pi. Table 1 shows the payoff matrix for user 1 and 
2. When two cognitive users select the same channel, the 
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collision occurs, and the reward is evenly split between 
the collided users. 

In order to develop a unified framework for the de¬ 
sign of efficient, and low complexity, cognitive medium 
access protocols, we have to seek an algorithm that can 
obtain the maximum total rewards (scores) in the CMBP 
context. In order to acquire the maximum total rewards, 
the algorithm has to have a mechanism that can avoid the 
‘Nash equilibrium’ which is the natural consequence for 
a group of independent selfish users. 

In this study, we demonstrate that overall optimiza¬ 
tion (the maximum total rewards) can be derived by 
using a physical device consisting of two kinds of 
incompressible-fluid in two or more cylinders. We call 
this analog computing device the ‘Tug-of-War (TOW) 
Bombe’ because it is analogous to the ‘Turing Bombe,’ 
which is an analog electric circuit developed by the 
British army during World War II for decoding the 
‘enigma code’ of the German army [13]. If one tries to 
solve the CMBP for M users and X channels using a 
conventional digital computer, it is necessary to calculate 
evaluation values of 0(X M ) for each iteration; the com¬ 
putational cost for solving the CMBP grows as an expo¬ 
nential function of N and M. Nevertheless, the TOW 
Bombe enables to solve the problem without paying the 
exponential computational cost. At each iteration, the 
TOW Bombe only requires M up-and-down operations 
for controlling the fluid interface levels in the correspond¬ 
ing cylinders. 

1. THE TUG-OF-WAR DYNAMICS 

In the previous studies [4, 6 ], we proposed the Tug-of- 
War (TOW) dynamics. Consider incompressible-fluid in 
a cylinder, as shown in Fig. 2. Here, variable Xk corre¬ 
sponds to the displacement of terminal k from an initial 
position, where k £ {A, B}. If Xk is greater than 0, we 
consider that the liquid selects machine k. 

We used the following estimate Qk (fc £ {A, B}): 

Qk(t) = Nk(t) - (1 + w)Lk(t). (1) 

Here, Nk is the number of playing machine k until time t, 
and Lk is the number of non-rewarded (i.e., failed) events 
in k until time t, where uo is a weighting parameter. 

The displacement Xa (= —Xb) is determined by the 
following difference equation: 

Xa( t) = Qa(£) — Qb(1) + 5. (2) 

Here, S(t) is an arbitrary fluctuation to which the liquid is 
subjected. Consequently the TOW dynamics evolve ac¬ 



cording to a particularly simple rule: in addition to the 
fluctuation, if machine k is played at each time t, +1 
and — uo are added to Xk(t — 1 ) when rewarded and non- 
rewarded, respectively (Fig. 2). The authors have shown 
that this simple dynamics gains more rewards (coins or 
packet transmissions) than that obtained by other popular 
algorithms for solving the MBP [1,2]. 

1.1. The Tug-of-War Principle 

In this subsection, we derive the learning rules of the 
TOW dynamics from a thought experiment, so that we 
can obtain the nearly optimal weighting parameter cco- 
In many popular algorithms such as e-greedy algorithm, 
an estimate for reward probability is updated only in a 
selected arm. In contrast, we consider the case that the 
sum of the reward probabilities 7 = Pa + Pb is given in 
advance. Then, we can update both estimates simultane¬ 
ously as follows, 
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Here, the top and bottom rows give the estimates based 
on Na times selecting A and Nb times selecting B, re¬ 
spectively. 

Each expected reward based on Na times selecting A 
and Nb times selecting B is given as follows, 


Q'k = + 
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— Xk — Lk + (7 — 1) Nj + Lj. 


( 3 ) 


Here, j is B if k is A, or A if k is B. These expected 
rewards Q' k s are not the same as the learning rules of the 
TOW, Qk s in Eq.(l). However, the following difference 
is directly used in the TOW, 


Qa — Qb = {Xa — Xb) — (1 + cj) (La — Lb ). (4) 

When we transform the expected rewards Q' k s into 

Ql = Qk/( 2 - 7 ), ( 5 ) 


we can obtain the difference 

Qa~Qb = (Na ~ N b ) - -2_ (L A - L B ). (6) 

2-7 

Comparing the coefficient of Eq.(4) and ( 6 ), those two 
differences are always equal when uo=uoq satisfies 

7 

uo = - -. (7) 

2-7 

Eventually, we can obtain the nearly optimal weighting 
parameter cc 0 in terms of 7 . 

This derivation means that the TOW has an equivalent 
learning rule with the system that is able to update both 
of the two estimates simultaneously. The TOW can im¬ 
itate the system that determines its next moves at time 
t + 1 in referring to the estimate of each machine even if 
it was not selected at time t, as if the two machines were 
selected simultaneously at time t. This unique feature in 
the learning rule is one of origins of the high performance 
of the TOW. 






















We carried out Monte Carlo simulations and con¬ 
firmed that the performance of the TOW with ujo is com¬ 
parable to its best performance, i.e., TOW with u: opt . De¬ 
tailed descriptions on these results will be presented else¬ 
where [14]. In addition, the essence of the process de¬ 
scribed here can be generalized to iC-machine and M- 
player cases. All we need is the following cc 0 - 

Wo = (8) 

2-7' 

l' = P(M) + P(M+ !)• ( 9 ) 

Here, P(m) denotes the top M-th reward probability. In 
fact, for iT-machine and M-player cases, we have de¬ 
signed a physical decision-making device that achieves 
the overall optimal state quickly and accurately [15]. 

2. THE TUG-OF-WAR BOMBE 

The decision-making device called the ‘Tug-of-War 
(TOW) Bombe’ for 3 users (1,2, and 3) and 5 chan¬ 
nels (A, T>, C, D, and E) is illustrated in Figure 3. Two 
kinds of incompressible-fluid (red and blue) are filled in 
coupled cylinders. The red (bottom) fluid handles the 
‘decision-making of a user’, while the blue (upper) one 
handles the ‘interaction among users’. Channel selec¬ 
tion of each user at each iteration is determined by the 
height of a green adjuster (a fluid interface level), and 
the highest channel is chosen. When the movements of 
red and blue adjusters stabilize to reach equilibrium, the 
‘tug-of-war principle’ in red fluid holds for each user. In 
other words, when one interface goes up, other four in¬ 
terfaces fall down, and efficient channel selections are at¬ 
tained. Simultaneously, the ‘action-reaction law’ is held 
by blue fluid (i.e., if the interface level of userl goes up, 
the interface levels of user2 and 3 fall down), which con¬ 
tributes to avoid collisions, and the TOW Bombe is able 
to search for an overall optimization solution accurately 
and quickly. 

The dynamics of the TOW Bombe are expressed as 
follows: 

&Q(i,k) (t) + 1) 

< 10 ) 



Fig. 3 The TOW Bombe for 3 users and 5 channels. 


— Q(i,k){t) ]y _ l (11) 

1 l^k 

Here, X^^ ( t ) denotes the height of the interface of user 
i and channel k at iteration step t. If channel k is chosen 
for user i at time t , A Q^k) (t) is +1 or —uj according to 
the result (rewarded or not). Otherwise, it is 0. 

In addition to the above-mentioned dynamics, oscil¬ 
lations are added to X^^y These oscillations are given 
from the external by controlling the blue and red adjusters 
appropriately. In this paper, we show the cases where 
the completely-synchronized oscillations osc^^ (t) are 
added to all the users, 

osc (i,k){t) = A sm(27r£/5 + 2i:{k — l)/5). (12) 

Here, k = 1, • • •, 5. 

Thus, the TOW Bombe operates only by adding an op¬ 
eration which goes up or down the interface level (+1 or 
— uj) according to the result (success or failure of packet 
transmission) for each user (total M times) at every time. 
After these operations, the interface levels move accord¬ 
ing to the volume conservation law, and it calculates next 
selection for each user. In the each user’s selection, an 
efficient search is realized due to the ‘TOW principle’ 
which can obtain a solution accurately and quickly in 
trial-and-error tasks. Moreover, by the interaction be¬ 
tween users via blue fluid, the ‘Nash equilibrium’ can 
be avoided consequently, and it achieves the overall opti¬ 
mization called ‘social maximum’ [16]. 

3. RESULTS 

In order to show that the TOW Bombe certainly avoids 
the Nash equilibrium and regularly achieves an overall 
optimization, we consider a case where (Pa, Pb, Pc , 
Pd, Pe) = (0.03, 0.05, 0 . 1 , 0 . 2 , 0.9) as a typical ex¬ 
ample. A part of the payoff tensor that has 125 (=5 3 ) 
elements is described as follows for simplicity; only ma¬ 
trix elements for which each user does not choose low- 
ranking A and B are shown (Table 2, 3 , and 4). For each 
matrix element, the reward probabilities are given in the 
order of users 1, 2, and 3 . 

‘Social maximum (SM)’ is a state in which the max¬ 
imum amount of total reward sum is obtained by all the 
users. In this problem, the social maximum corresponds 
to a ‘segregation state’ in which the users choose top three 
different machines (C. D , E) respectively; there exist six 
segregation states that are indicated by SM in the Tables. 
On the other hand, the Nash equilibrium (NE) is a state in 

Table 2 Payoff matrix of the case where (Pc, Pd, 
Pe)=(0A, 0 . 2 , 0.9), user 3 chooses C. 
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Table 3 Payoff matrix of the case where {Pc, Pd, 
Pe)={ 0.1, 0.2, 0.9), user 3 chooses D. 



2: C 

2: D 

2: E 

1: C 

.05, .05, .2 


.1, .9, .2 SM 

1: D 


2/30, 2/30, 2/30 

• 1, -9, .1 

1: E 

.9, .1, .2 SM 

.9, .1, .1 

.45, .45, .2 


Table 4 Payoff matrix of the case where {Pc, Pd, 
Pe)={ 0.1, 0.2, 0.9), user 3 chooses E. 



2: C 

2: D 

2: E 

1: C 

.05, .05, .9 

.1, .2, .9 SM 

.1, .45, .45 

1: D 

.2, .1, .9 SM 

.i,.i, .9 

.2, .45, .45 

1: E 

.45, .1, .45 

.45, .2, .45 

.3, .3, .3 NE 


which all the users choose machine E independently of 
others’ decisions; machine E gives the reward with the 
highest probability when each user behaves in a selfish 
manner. 

The performance of the TOW Bombe was evaluated 
by a score: the number of rewards (coins) a user ob¬ 
tained in his (her) 1000 plays. In cognitive radio, the 
score corresponds to the amount of packets that has suc¬ 
cessfully transmitted. Figure 4 shows the scores of the 
TOW Bombe in the typical example where (Pa, Pb, Pc, 
Pd, Pe) = (0.03, 0.05, 0.1, 0.2, 0.9). Since 1000 sam¬ 
ples were used, there are 1000 circles for each data. Each 
circle indicates the score obtained by user i (horizontal 
axis) and user j (vertical axis) for one sample. There 
exist six clusters in Figure 4. These clusters correspond 
to the two dimensional projections of the six segregation 
states, implying the overall optimization. The social max¬ 
imum points are given as follows: (the score of user 1, the 
score of user 2, the score of user 3) = (100, 200, 900), 
(100, 900, 200), (200, 100, 900), (200, 900, 100), (900, 
100, 200), and (900, 200,100). The TOW Bombe did not 
reach the Nash equilibrium state (300, 300, 300). 

Figure 5 shows sample averages of the scores un¬ 
til 1000 plays, where we showed the average of each 



Fig. 4 Scores of the TOW Bombe in the case where {Pa, 
Pb, Pc, Pd, Pe) = (0.03, 0.05, 0.1, 0.2, 0.9). 



Fig. 5 Sample averages of the scores of the TOW Bombe 

in the case where {Pa, Pb, Pc, Pd, Pe) = (0.03, 
0.05,0.1,0.2,0.9). 

user’s score and that of the total score of all the users. 
We can see that the average total score has gained 
100+200+900=1200, which is the value of the social 
maximum, while the fairness is maintained in most cases. 
Here, we set parameter cc at 0.08 (Eq. (8) and (9) were 
calculated as 7'=P#+Pc). 

4. CONCLUSION AND DISCUSSION 

We demonstrated that an analog decision-making de¬ 
vice, called the TOW Bombe, is implemented physically 
by using two kinds of incompressible-fluid in coupled 
cylinders and achieves overall optimization in the channel 
allocation problem in cognitive radio. The TOW Bombe 
enables to solve the allocation problem for M users and 
N channels by repeating M up-and-down operations of 
the fluid interface levels in the cylinders at each iteration; 
it does not require the calculation of exponentially-many 
(0{N M )) evaluation values that are required when using 
a conventional digital computer. This suggests that an 
advantage of analog computation do exist even in today’s 
digital age. 

The TOW Bombe can also be implemented on the ba¬ 
sis of quantum physics. In fact, the authors have exploited 
optical energy transfer dynamics between quantum dots 
to construct the decision-making device [17,18]. Our 
method may be applicable not only to a class problem de¬ 
rived from cognitive radio but also to broader varieties of 
game payoff matrices, implying that wider applications 
are expected. We will report these observations and re¬ 
sults elsewhere in the future. 
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